LSTM 
=================
一种常用的 循环神经网络（RNN） 模块，用于处理具有时序依赖特征的数据（如语音、文本、时间序列等）。每个时间步的公式化描述如下。

.. math::

    \begin{aligned}
    i_t &= \sigma(W_{ii} x_t + W_{hi} h_{t-1} + b_i) && \text{(输入门)} \\[6pt]
    f_t &= \sigma(W_{if} x_t + W_{hf} h_{t-1} + b_f) && \text{(遗忘门)} \\[6pt]
    g_t &= \tanh(W_{ig} x_t + W_{hg} h_{t-1} + b_g) && \text{(候选状态)} \\[6pt]
    o_t &= \sigma(W_{io} x_t + W_{ho} h_{t-1} + b_o) && \text{(输出门)} \\[6pt]
    c_t &= f_t \odot c_{t-1} + i_t \odot g_t && \text{(细胞状态更新)} \\[6pt]
    h_t &= o_t \odot \tanh(c_t) && \text{(隐藏状态更新)}
    \end{aligned}
    
- :math:`x_t` : 当前时间步输入向量
- :math:`h_{t-1}` : 上一时间步的隐藏状态
- :math:`c_{t-1}` : 上一时间步的细胞状态
- :math:`i_t, f_t, g_t, o_t` : 四个门（输入门、遗忘门、候选门、输出门）
- :math:`W_*` : 对应的权重矩阵
- :math:`b_*` : 偏置项
- :math:`\sigma(\cdot)` : Sigmoid 函数
- :math:`\odot` : 元素乘


输入：
    - **input** - 输入序列数据，形状为  :math:`(seq_len, batch, input_size)`，即每个时间步的输入特征。
    - **weight_i** - 输入到各门 :math:`(input、forget、cell、output)` 的权重矩阵，大小为 4 * hidden_size * input_size。
    - **weight_h** - 上一隐藏状态到各门的权重矩阵，大小为  :math:`4 * hidden_size * hidden_size`
    - **input_bias** - 输入部分的偏置项，对应 4 个门的偏置。
    - **state_bias** - 隐藏状态部分的偏置项（也是  :math:`4 * hidden_size`），与 input_bias 一起求和形成总偏置。
    - **hidden_state** - 当前批次初始隐藏状态输入（  :math:`h₀` ），执行后更新为最后时刻的隐藏状态输出（  :math:`hₜ`）
    - **cell_state** - 当前批次初始细胞状态输入（  :math:`c₀`），执行后更新为最后时刻的细胞状态输出（  :math:`cₜ`）。
    - **buffer** - 临时工作区指针数组（中间计算缓存，如门值、激活结果、临时矩阵等，用于优化性能）。
    - **LstmParameter** - LSTM 配置参数结构体，包含输入大小、隐藏层维度、序列长度、是否双向等信息。
    - **core_mask** - 核掩码（仅适用于共享存储版本）。

**LstmParameter定义：**

.. code-block:: c
    :linenos:

    typedef struct LstmParameter {
    int input_size_;//每个时间步输入向量的维度（输入特征数）。
    int hidden_size_;//LSTM 隐藏状态的维度（每个门的内部计算大小）。
    int project_size_;//投影层输出维度（用于 LSTMP，有则在输出前线性压缩隐藏状态）。
    int output_size_;//实际输出维度，等于 hidden_size_ 或 project_size_（取决于是否使用投影层）。
    int seq_len_;//输入序列的时间步数（序列长度）。
    int batch_;//批次大小（一次处理的样本数量）。
    // other parameter
    int output_step_;//指定输出第几个时间步的结果（通常为最后一步或每步）。
    bool bidirectional_;//是否为双向 LSTM（true 表示前向和后向各一层）。
    float zoneout_cell_;//单元状态的 Zoneout 比例（防止过拟合的正则化参数）。
    float zoneout_hidden_;//隐藏状态的 Zoneout 比例（防止过拟合）。
    int input_row_align_;//输入张量的行对齐参数（用于 DMA 或 SIMD 加速的内存对齐）。
    int input_col_align_;//输入张量的列对齐参数。
    int state_row_align_;//状态张量（hidden/cell）的行对齐参数。
    int state_col_align_;//状态张量的列对齐参数。
    int proj_col_align_;//投影层矩阵的列对齐参数。
    bool has_bias_;//是否包含偏置项（true 表示使用 bias）。
    } LstmParameter;

输出：
    - **output** - 计算结果地址，存放 LSTM 每个时间步输出结果的缓冲区，维度通常为 :math:`(seq\_len, batch, output\_size)`

    支持平台：
        ``FT78NE``
        ``MT7004``

    .. note::
        - FT78NE 支持fp32
        - MT7004 支持fp32

**共享存储版本:**

.. c:function:: void fp_Lstm_s(float *output, const float *input, const float *weight_i, const float *weight_h, const float *input_bias,const float *state_bias, float *hidden_state, float *cell_state, float *buffer[9],const LstmParameter *lstm_param, int core_mask)

**C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 40-42

        //FT78NE示例
        #include <stdio.h>
        #include <lstm.h>
        
        int main(int argc, char* argv[]) {
            LstmParameter *lstm_param = (LstmParameter *)0x90000000;
            lstm_param->seq_len_ = 20;
            lstm_param->batch_ = 1;
            lstm_param->input_size_ = 2000;
            lstm_param->hidden_size_ = 3;
            lstm_param->bidirectional_ = false;
            float * input = (float *)0xA0000000;   //input在DDR空间
            float * weight_i = (float *)0xA1000000;
            float * weight_h = (float *)0xA3000000;
            float *input_bias_ =(float *) 0xB0900000;
            float * state_bias_ =(float *) 0xB0B00000;
            float * output_s = (float *)0xC0000000;
            float *hidden_state_s = (float *)0xC0100000;
            float *cell_state_s = (float *)0xC0200000;
            float *buffer[9];
            float * packed_input_ = (float *)0xB0000000;
            buffer[0] = packed_input_;
            float * gate = (float *)0xB0100000;
            buffer[1] = gate;
            float * packed_state = (float *)0xB0200000;
            buffer[2] = packed_state;
            float * state_gate = (float *)0xB0300000;
            buffer[3] = state_gate;
            float * cell_buffer = (float *)0xB0400000;
            buffer[4] = cell_buffer;
            float * hidden_buffer = (float *)0xB0500000;
            buffer[5] = hidden_buffer;
            float * packed_output = (float *)0xB0600000;
            buffer[6] = packed_output;
            float * left_matrix = (float *)0xB0700000;
            buffer[7] = left_matrix;
            float * packed_ptr = (float *)0xB0800000;
            buffer[8] = packed_ptr;
            int core_mask = 0xff;
            fp_Lstm_s(output_s, input, weight_i, weight_h, input_bias_,
            state_bias, hidden_state_s, cell_state_s, buffer,
            lstm_param, core_mask);
            return 0;
        }

**私有存储版本:**

.. c:function:: void fp_Lstm_p(float *output, const float *input, const float *weight_i, const float *weight_h, const float *input_bias, const float *state_bias, float *hidden_state, float *cell_state, float *buffer[9], const LstmParameter *lstm_param)

          **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 38-40

        //FT78NE示例
        #include <stdio.h>
        #include <lstm.h>
        int main(int argc, char* argv[]) {
            LstmParameter *lstm_param = (LstmParameter *)0x10000000;
            lstm_param->seq_len_ = 4;
            lstm_param->batch_ = 1;
            lstm_param->input_size_ = 2;
            lstm_param->hidden_size_ = 3;
            lstm_param->bidirectional_ = false;
            float * input = (float *)0x10000200;   //input在DDR空间
            float * weight_i = (float *)0x10000400;
            float * weight_h = (float *)0x10000600;
            float *input_bias_ =(float *) 0x10000800;
            float * state_bias_ =(float *) 0x10000A00;
            float * output_s = (float *)0x10000C00;
            float *hidden_state_s = (float *)0x10000E00;
            float *cell_state_s = (float *)0x10001000;
            float *buffer[9];
            float * packed_input_ = (float *)0x10001200;
            buffer[0] = packed_input_;
            float * gate = (float *)0x10001400;
            buffer[1] = gate;
            float * packed_state = (float *)0x10001600;
            buffer[2] = packed_state;
            float * state_gate = (float *)0x10001800;
            buffer[3] = state_gate;
            float * cell_buffer = (float *)0x10001A00;
            buffer[4] = cell_buffer;
            float * hidden_buffer = (float *)0x10001C00;
            buffer[5] = hidden_buffer;
            float * packed_output = (float *)0x10001F00;
            buffer[6] = packed_output;
            float * left_matrix = (float *)0x10002000;
            buffer[7] = left_matrix;
            float * packed_ptr = (float *)0x10002200;
            buffer[8] = packed_ptr;
            fp_Lstm_p(output_s, input, weight_i, weight_h, input_bias_,
            state_bias, hidden_state_s, cell_state_s, buffer,
            lstm_param);
            return 0;
        }